This week problem set will explore behavior of support vector classifiers and SVMs (following the distinction made in ISLR) on banknote authentication dataset from UCI ML archive. We worked with it on multiple occasions before (most recently two weeks ago evaluating performance of logistic regression, discriminant analysis and KNN on it):
dbaDat <- read.table("data_banknote_authentication.txt",sep=",")
colnames(dbaDat) <- c("var","skew","curt","entr","auth")
dbaDat$auth <- factor(dbaDat$auth)
dim(dbaDat)
## [1] 1372 5
summary(dbaDat)
## var skew curt entr
## Min. :-7.0421 Min. :-13.773 Min. :-5.2861 Min. :-8.5482
## 1st Qu.:-1.7730 1st Qu.: -1.708 1st Qu.:-1.5750 1st Qu.:-2.4135
## Median : 0.4962 Median : 2.320 Median : 0.6166 Median :-0.5867
## Mean : 0.4337 Mean : 1.922 Mean : 1.3976 Mean :-1.1917
## 3rd Qu.: 2.8215 3rd Qu.: 6.815 3rd Qu.: 3.1793 3rd Qu.: 0.3948
## Max. : 6.8248 Max. : 12.952 Max. :17.9274 Max. : 2.4495
## auth
## 0:762
## 1:610
##
##
##
##
head(dbaDat)
## var skew curt entr auth
## 1 3.62160 8.6661 -2.8073 -0.44699 0
## 2 4.54590 8.1674 -2.4586 -1.46210 0
## 3 3.86600 -2.6383 1.9242 0.10645 0
## 4 3.45660 9.5228 -4.0112 -3.59440 0
## 5 0.32924 -4.4552 4.5718 -0.98880 0
## 6 4.36840 9.6718 -3.9606 -3.16250 0
pairs(dbaDat[,1:4],col=as.numeric(dbaDat$auth))
Here we will use SVM implementation available in library e1071 to fit classifiers with linear and radial (polynomial for extra points) kernels and compare their relative performance as well as to that of random forest and KNN.
Use svm from library e1071 with kernel="linear" to fit classifier (e.g. ISLR Ch.9.6.1) to the entire banknote authentication dataset setting parameter cost to 0.001, 1, 1000 and 1 mln. Describe how this change in parameter cost affects model fitting process (hint: the difficulty of the underlying optimization problem increases with cost – can you explain what drives it?) and its outcome (how does the number of support vectors change with cost?) and what are the implications of that. Explain why change in cost value impacts number of support vectors found. (Hint: there is an answer in ISLR.) Use tune function from library e1071 (see ISLR Ch.9.6.1 for details and examples of usage) to determine approximate value of cost (in the range between 0.1 and 100 – the suggested range spanning ordes of magnitude should hint that the density of the grid should be approximately logarithmic – e.g. 1, 3, 10, … or 1, 2, 5, 10, … etc.) that yields the lowest error in cross-validation employed by tune. Setup a resampling procedure repeatedly splitting entire dataset into training and test, using training data to tune cost value and test dataset to estimate classification error. Report and discuss distributions of test errors from this procedure and selected values of cost.
costs <- c(0.001, 1, 1000, 1e7)
for (cost in costs) {
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "linear", cost = cost, scale = TRUE)
svmCM <- table(dbaDat$auth, predict(svmFit, dbaDat))
print(paste("Cost = ", cost))
print(svmCM)
}
## [1] "Cost = 0.001"
##
## 0 1
## 0 737 25
## 1 119 491
## [1] "Cost = 1"
##
## 0 1
## 0 742 20
## 1 1 609
## [1] "Cost = 1000"
##
## 0 1
## 0 752 10
## 1 6 604
## [1] "Cost = 1e+07"
##
## 0 1
## 0 744 18
## 1 119 491
Accuracy improves on the jump from 0.001 to 1 cost, but drops above that. In addition, the time it takes to fit the higher costs is much longer than for the low costs.
costs <- c(0.1, 0.3, 0.5, 1, 3, 10, 30, 100)
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(dbaDat)))
xvalResultsLinSVM <- tibble()
for (fold in 1:k) {
dbaTrain <- dbaDat[kfInd != fold, ]
dbaTest <- dbaDat[kfInd == fold, ]
tune.out <- tune(svm, auth ~ ., data = dbaTrain, kernel = "linear", scale = TRUE,
ranges = list(cost = costs))
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "linear", scale = TRUE,
cost = tune.out$best.parameters$cost)
svmCM <- table(dbaTest$auth, predict(svmFit, dbaTest))
xvalResultsLinSVM <- rbind(xvalResultsLinSVM, tibble(
classifier = "linsvm",
fold = fold,
best_cost = tune.out$best.parameters$cost,
err = 1 - sum(diag(svmCM)) / sum(svmCM)
))
}
ggplot(xvalResultsLinSVM, aes(x = best_cost)) + geom_bar()
ggplot(xvalResultsLinSVM, aes(x = classifier, y = err)) + geom_boxplot()
Typically, a cost of 3 is chosen as the best, but about 1/3 of the time 10 is chosen instead.
Fit random forest classifier on the entire banknote authentication dataset with default parameters. Calculate resulting misclassification error as reported by the confusion matrix in random forest output. Explain why error reported in random forest confusion matrix represents estimated test (as opposed to train) error of the procedure. Compare resulting test error to that for support vector classifier obtained above and discuss results of such comparison.
rfFit <- randomForest(dbaDat %>% select(-auth), dbaDat$auth)
rfCM <- table(dbaDat$auth, predict(rfFit))
rfCM
##
## 0 1
## 0 756 6
## 1 4 606
The way random forest is built is already bootstrapped, so additional cross-validation is not necessary. It appears the performance is very good, even slightly better than the SVM classifier.
rfErr <- 1 - sum(diag(rfCM)) / sum(rfCM)
rfErr
## [1] 0.00728863
xvalResultsRF <- tibble(
classifier = "rf",
err = rfErr
)
Use convenience wrapper tune.knn provided by the library e1071 on the entire dataset to determine optimal value for the number of the nearest neighbors ‘k’ to be used in KNN classifier. Consider our observations from week 9 problem set when choosing range of values of k to be evaluated by tune.knn. Setup resampling procedure similar to that used above for support vector classifier that will repeatedly: a) split banknote authentication dataset into training and test, b) use tune.knn on training data to determine optimal k, and c) use k estimated by tune.knn to make KNN classifications on test data. Report and discuss distributions of test errors from this procedure and selected values of k, compare them to those obtained for random forest and support vector classifier above.
ks <- c(3, 10, 30, 100, 300)
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(dbaDat)))
xvalResultsKNN <- tibble()
for (fold in 1:k) {
dbaTrain <- dbaDat[kfInd != fold, ]
dbaTest <- dbaDat[kfInd == fold, ]
tune.out <- tune.knn(select(dbaTrain, -auth), dbaTrain$auth, k = ks)
knnFit <- knn(select(dbaTrain, -auth), select(dbaTest, -auth), dbaTrain$auth, k = tune.out$best.parameters$k)
knnCM <- table(dbaTest$auth, predict(svmFit, dbaTest))
xvalResultsKNN <- rbind(xvalResultsKNN, tibble(
fold = fold,
classifier = "knn",
best_k = tune.out$best.parameters$k,
err = 1 - sum(diag(knnCM)) / sum(knnCM)
))
}
ggplot(xvalResultsKNN, aes(x = best_k)) + geom_bar()
ggplot(xvalResultsKNN, aes(x = classifier, y = err)) + geom_boxplot()
Accuracy between the best knn (with best k, usually 3 and sometimes 10) and the best svm with linear kernal is pretty similar. Most get only one or two classifications incorrect in a test set.
Plot SVM model fit to the banknote authentication dataset using (for the ease of plotting) only variance and skewness as predictors variables, kernel="radial", cost=1 and gamma=1 (see ISLR Ch.9.6.2 for an example of that done with a simulated dataset). You should be able to see in the resulting plot the magenta-cyan classification boundary as computed by this model. Produce the same kinds of plots using 0.01 and 100 as values of gamma also. Compare classification boundaries between these three plots and describe how they are impacted by the change in the value of gamma. Can you trace it back to the role of gamma in the equation introducing it with the radial kernel in ISLR?
gammas <- c(0.01, 0.1, 1, 10, 100)
for (gamma in gammas) {
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "radial", cost = 1, gamma = gamma, scale = TRUE)
plot(svmFit, dbaDat, skew ~ var)
title(sub = paste("gamma =", gamma))
}
As gamma increases, the area of space where points would be classified as “1” shrinks. At first the classification boundary is nearly linear, but it quickly becomes round and shrinks in size. I believe the gamma parameter is tuning how far away the “influence” of the “1” points reaches. As gamma gets higher and higher, the “1” classification area shrinks smaller and smaller into the maximum “influence” area of “1” classification.
Similar to how it was done above for support vector classifier (and KNN), set up a resampling process that will repeatedly: a) split the entire dataset (using all attributes as predictors) into training and test datasets, b) use tune function to determine optimal values of cost and gamma and c) calculate test error using these values of cost and gamma. You can start with cost=c(1,2,5,10,20) and gamma=c(0.01,0.02,0.05,0.1,0.2) as starting ranges to evaluate by tune, but please feel free to experiment with different sets of values and discuss the results of it and how you would go about selecting those ranges starting from scratch. Present resulting test error graphically, compare it to that of support vector classifier (with linear kernel), random forest and KNN classifiers obtained above and discuss results of these comparisons.
costs <- c(0.1, 0.3, 0.5, 1, 3, 10, 30, 100)
gammas <- c(0.01, 0.02, 0.05, 0.1, 0.2)
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(dbaDat)))
xvalResultsRadSVM <- tibble()
for (fold in 1:k) {
dbaTrain <- dbaDat[kfInd != fold, ]
dbaTest <- dbaDat[kfInd == fold, ]
tune.out <- tune(svm, auth ~ ., data = dbaTrain, kernel = "radial", scale = TRUE,
ranges = list(cost = costs, gamma = gammas))
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "radial", scale = TRUE,
cost = tune.out$best.parameters$cost, gamma = tune.out$best.parameters$gamma)
svmCM <- table(dbaTest$auth, predict(svmFit, dbaTest))
xvalResultsRadSVM <- rbind(xvalResultsRadSVM, tibble(
fold = fold,
classifier = "radsvm",
best_cost = tune.out$best.parameters$cost,
best_gamma = tune.out$best.parameters$gamma,
err = 1 - sum(diag(svmCM)) / sum(svmCM)
))
}
ggplot(xvalResultsRadSVM, aes(x = best_cost)) + geom_bar()
ggplot(xvalResultsRadSVM, aes(x = best_gamma)) + geom_bar()
ggplot(xvalResultsRadSVM, aes(x = classifier, y = err)) + geom_boxplot()
The radial basis svm is perfect! Zero percent error. Clearly the ability of the radial basis function to have non-linear classification boundary is a huge boon to its classification performance. Below, I provide a graphical comparison of all of the techniques analyzed in this problem set.
Repeat what was done above (plots of decision boundaries for various interesting values of tuning parameters and test error for their best values estimated from training data) using kernel="polynomial". Determine ranges of coef0, degree, cost and gamma to be evaluated by tune. Present and discuss resulting test error and how it compares to linear and radial kernels and those of random forest and KNN.
gammas <- c(0.01, 0.1, 1, 10)
for (gamma in gammas) {
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "radial", scale = TRUE,
cost = 1, gamma = gamma, degree = 2, coef0 = 0)
plot(svmFit, dbaDat, skew ~ var)
title(sub = paste("gamma =", gamma))
}
costs <- c(1, 3, 10, 30, 100)
for (cost in costs) {
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "radial", scale = TRUE,
cost = cost, gamma = 1, degree = 2, coef0 = 0)
plot(svmFit, dbaDat, skew ~ var)
title(sub = paste("cost =", cost))
}
degrees <- c(1, 2, 3)
for (degree in degrees) {
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "radial", scale = TRUE,
cost = 10, gamma = 0.1, degree = degree, coef0 = 0)
plot(svmFit, dbaDat, skew ~ var)
title(sub = paste("degree =", degree))
}
coef0s <- c(0, 0.5, 1)
for (coef0 in coef0s) {
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "radial", scale = TRUE,
cost = 10, gamma = 0.1, degree = 2, coef0 = coef0)
plot(svmFit, dbaDat, skew ~ var)
title(sub = paste("coef0 =", coef0))
}
costs <- c(10, 30, 100)
gammas <- c(0.08, 0.1, 0.12)
degrees <- c(2, 3)
coef0s <- c(0.005, 0.01, 0.03)
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(dbaDat)))
xvalResultsPolySVM <- tibble()
for (fold in 1:k) {
dbaTrain <- dbaDat[kfInd != fold, ]
dbaTest <- dbaDat[kfInd == fold, ]
tune.out <- tune(svm, auth ~ ., data = dbaTrain, kernel = "polynomial", scale = TRUE,
ranges = list(cost = costs, gamma = gammas, degree = degrees, coef0 = coef0s))
svmFit <- svm(auth ~ ., data = dbaDat, kernel = "polynomial", scale = TRUE,
cost = tune.out$best.parameters$cost)
svmCM <- table(dbaTest$auth, predict(svmFit, dbaTest))
xvalResultsPolySVM <- rbind(xvalResultsPolySVM, tibble(
fold = fold,
classifier = "polysvm",
best_cost = tune.out$best.parameters$cost,
best_gamma = tune.out$best.parameters$gamma,
best_degree = tune.out$best.parameters$degree,
best_coef0 = tune.out$best.parameters$coef0,
err = 1 - sum(diag(svmCM)) / sum(svmCM)
))
}
ggplot(xvalResultsPolySVM, aes(x = best_cost)) + geom_bar()
ggplot(xvalResultsPolySVM, aes(x = best_gamma)) + geom_bar()
ggplot(xvalResultsPolySVM, aes(x = best_degree)) + geom_bar()
ggplot(xvalResultsPolySVM, aes(x = best_coef0)) + geom_bar()
ggplot(xvalResultsPolySVM, aes(x = classifier, y = err)) + geom_boxplot()
xvalResultsComplete <- bind_rows(
xvalResultsLinSVM,
xvalResultsRadSVM,
xvalResultsPolySVM,
xvalResultsKNN,
xvalResultsRF
)
ggplot(xvalResultsComplete, aes(x = classifier, y = err)) + geom_boxplot()
All told, the radial basis svm was the best classifier, posting a 0% error rate for all ten simulations. After tuning, the polynomial svm is second best with a 0% error most of the time and occasional ~0.7% error. Random forest performs fairly well, and the linear svm and knn are the worst but still very good performers with usually a 1-2% error rate.